transductive zero-shot learning
- North America > United States (0.14)
- Europe > United Kingdom (0.14)
- Asia > China > Guangxi Province > Nanning (0.04)
- Transportation > Passenger (1.00)
- Transportation > Air (1.00)
- Aerospace & Defense > Aircraft (1.00)
- (2 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Transductive Zero-Shot Learning with Visual Structure Constraint
To recognize objects of the unseen classes, most existing Zero-Shot Learning (ZSL) methods first learn a compatible projection function between the common semantic space and the visual space based on the data of source seen classes, then directly apply it to the target unseen classes. However, in real scenarios, the data distribution between the source and target domain might not match well, thus causing the well-known domain shift problem. Based on the observation that visual features of test instances can be separated into different clusters, we propose a new visual structure constraint on class centers for transductive ZSL, to improve the generality of the projection function (\ie alleviate the above domain shift problem). Specifically, three different strategies (symmetric Chamfer-distance,Bipartite matching distance, and Wasserstein distance) are adopted to align the projected unseen semantic centers and visual cluster centers of test instances. We also propose a new training strategy to handle the real cases where many unrelated images exist in the test dataset, which is not considered in previous methods. Experiments on many widely used datasets demonstrate that the proposed visual structure constraint can bring substantial performance gain consistently and achieve state-of-the-art results.
- North America > United States (0.14)
- Europe > United Kingdom (0.14)
- Asia > China > Guangxi Province > Nanning (0.04)
- Transportation > Passenger (1.00)
- Transportation > Air (1.00)
- Aerospace & Defense > Aircraft (1.00)
- (2 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Reviews: Transductive Zero-Shot Learning with Visual Structure Constraint
Strength: - The paper proposes an interesting and novel approach for transductive zero-shot learning. It would be great to also include zero shot performance on ImageNet (this is most likely missing as there are not attribute annotations for ImageNet, but the approach does not seem to be limited to attributes for transfer) 1.2. It would be interesting to quantitatively compare to [31] and [34] as ablations of the author's appraoch from which authors took inspiration. The authors claim in the reproducibility checklist to have "Clearly defined error bars" and "A description of results with central tendency (e.g. The paper misses to discuss (qualitatively and quantitatively) recent related work including [A].
Reviews: Transductive Zero-Shot Learning with Visual Structure Constraint
The submission originally received scores mixed region that put it into the borderline region. The reviewers praised the simple and apparently effective method, but also noted a number of issues, in particular an unclear relation to [34] (which itself is rather unclear) as well as an insufficient experiment evaluation. In their response the authors provided additional information and results, which the reviewers appreciated. A detailed discussion followed, that ultimately let to the conclusion that the contribution is valuable and that authors should not be punished for a lack of clarity in the prior work [34]. Therefore, the recommendation is to accept the work.
Transductive Zero-Shot Learning with Visual Structure Constraint
To recognize objects of the unseen classes, most existing Zero-Shot Learning (ZSL) methods first learn a compatible projection function between the common semantic space and the visual space based on the data of source seen classes, then directly apply it to the target unseen classes. However, in real scenarios, the data distribution between the source and target domain might not match well, thus causing the well-known domain shift problem. Based on the observation that visual features of test instances can be separated into different clusters, we propose a new visual structure constraint on class centers for transductive ZSL, to improve the generality of the projection function (\ie alleviate the above domain shift problem). Specifically, three different strategies (symmetric Chamfer-distance,Bipartite matching distance, and Wasserstein distance) are adopted to align the projected unseen semantic centers and visual cluster centers of test instances. We also propose a new training strategy to handle the real cases where many unrelated images exist in the test dataset, which is not considered in previous methods. Experiments on many widely used datasets demonstrate that the proposed visual structure constraint can bring substantial performance gain consistently and achieve state-of-the-art results.
Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning
Menghini, Cristina, Delworth, Andrew, Bach, Stephen H.
Fine-tuning vision-language models (VLMs) like CLIP to downstream tasks is often necessary to optimize their performance. However, a major obstacle is the limited availability of labeled data. We study the use of pseudolabels, i.e., heuristic labels for unlabeled data, to enhance CLIP via prompt tuning. Conventional pseudolabeling trains a model on labeled data and then generates labels for unlabeled data. VLMs' zero-shot capabilities enable a ``second generation'' of pseudolabeling approaches that do not require task-specific training on labeled data. By using zero-shot pseudolabels as a source of supervision, we observe that learning paradigms such as semi-supervised, transductive zero-shot, and unsupervised learning can all be seen as optimizing the same loss function. This unified view enables the development of versatile training strategies that are applicable across learning paradigms. We investigate them on image classification tasks where CLIP exhibits limitations, by varying prompt modalities, e.g., textual or visual prompts, and learning paradigms. We find that (1) unexplored prompt tuning strategies that iteratively refine pseudolabels consistently improve CLIP accuracy, by 19.5 points in semi-supervised learning, by 28.4 points in transductive zero-shot learning, and by 15.2 points in unsupervised learning, and (2) unlike conventional semi-supervised pseudolabeling, which exacerbates model biases toward classes with higher-quality pseudolabels, prompt tuning leads to a more equitable distribution of per-class accuracy. The code to reproduce the experiments is at github.com/BatsResearch/menghini-enhanceCLIPwithCLIP-code.
- North America > United States (0.28)
- Europe > United Kingdom (0.14)
- Asia > China > Guangxi Province > Nanning (0.04)
- Transportation > Passenger (1.00)
- Transportation > Air (1.00)
- Aerospace & Defense > Aircraft (1.00)
- (2 more...)
Transductive Zero-Shot Learning using Cross-Modal CycleGAN
Bordes, Patrick, Zablocki, Eloi, Piwowarski, Benjamin, Gallinari, Patrick
In Computer Vision, Zero-Shot Learning (ZSL) aims at classifying unseen classes -- classes for which no matching training image exists. Most of ZSL works learn a cross-modal mapping between images and class labels for seen classes. However, the data distribution of seen and unseen classes might differ, causing a domain shift problem. Following this observation, transductive ZSL (T-ZSL) assumes that unseen classes and their associated images are known during training, but not their correspondence. As current T-ZSL approaches do not scale efficiently when the number of seen classes is high, we tackle this problem with a new model for T-ZSL based upon CycleGAN. Our model jointly (i) projects images on their seen class labels with a supervised objective and (ii) aligns unseen class labels and visual exemplars with adversarial and cycle-consistency objectives. We show the efficiency of our Cross-Modal CycleGAN model (CM-GAN) on the ImageNet T-ZSL task where we obtain state-of-the-art results. We further validate CM-GAN on a language grounding task, and on a new task that we propose: zero-shot sentence-to-image matching on MS COCO.
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- (5 more...)
Transductive Zero-Shot Learning with Visual Structure Constraint
Wan, Ziyu, Chen, Dongdong, Li, Yan, Yan, Xingguang, Zhang, Junge, Yu, Yizhou, Liao, Jing
To recognize objects of the unseen classes, most existing Zero-Shot Learning (ZSL) methods first learn a compatible projection function between the common semantic space and the visual space based on the data of source seen classes, then directly apply it to the target unseen classes. However, in real scenarios, the data distribution between the source and target domain might not match well, thus causing the well-known domain shift problem. Based on the observation that visual features of test instances can be separated into different clusters, we propose a new visual structure constraint on class centers for transductive ZSL, to improve the generality of the projection function (\ie alleviate the above domain shift problem). Specifically, three different strategies (symmetric Chamfer-distance,Bipartite matching distance, and Wasserstein distance) are adopted to align the projected unseen semantic centers and visual cluster centers of test instances. We also propose a new training strategy to handle the real cases where many unrelated images exist in the test dataset, which is not considered in previous methods.